Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

162

Applications in Computer Vision

Based on our assumption, for wi we formulate the ideal bimodal distribution as

P(wi|Θi) = β^k

k=1

p(wi|Θ^k

i ⁾^,

(6.49)

where the number of distributions is set as 2 in this paper. Θ^l

k ⁼^{^μ^k

i ^{, σ}^k

i ^}^{denotes the}

parameters of the k-th distribution, i.e., μ^k

i ^{denotes the mean value and}^σ^k

i ^{denotes the}

variance, respectively.

To solve the GMM with the observed data wi, i.e., the weight ensemble in the i-th

layer. We introduce the hidden variable ξ^jk

to formulate the maximum likelihood estimation

(MLE) of GMM as

ξ^jk

w^j

i ^∈^p^k

else

(6.50)

where ξ^jk

is the hidden variable that describes the aﬃliation of w^j

i ^and^p^k

i ^{(simpliﬁed deno-}

tation of p(wi|Θ^k

i ^{)). We then deﬁne the likelihood function}^P⁽^w^j

i ^{, ξ}^jk

i ^|^Θ^k

i ^{) as}

P(w^j

i ^{, ξ}^jk

i ^|^Θ^k

i ^{) =}

k=1

(β^k

i ⁾^|^p^k

i ^|

j=1

Ω^f⁽^w^j

i ^{, μ}^k

i ^{, σ}^k

i ⁾

ξ^jk

(6.51)

where Ω=

2π|σ^k

i ^|^,^|^p^k

i ^|⁼^mⁱ

j=1 ^ξ^jk

i ^{, and}^mⁱ⁼²

k=1 ^|^p^k

i ^|^{. And}^f⁽^w^j

i ^{, μ}^k

i ^{, σ}^k

i ^{) is deﬁned as}

f(w^j

i ^{, μ}^k

i ^{, σ}^k

i ^{) = exp(}⁻¹

2σ^k

(w^j

i ⁻^μ^k

i ⁾²⁾^.

(6.52)

Hence, for every single weight w^j

i ^,^ξ^jk

can be computed by maximizing the likelihood as

max

ξ^jk

i ^,^∀^j,k

log P(w^j

i ^{, ξ}^jk

i ^|^Θ^k

i ⁾^|^w^j

i ^,^Θ^k

(6.53)

where E(·) represents the estimate. Therefore, the maximum likelihood estimate ^ˆξ^jk

calculated as

ˆξ^jk

=E(ξ^jk

i ^|^w^j

i ^,^Θ^k

i ⁾

=P(ξ^jk

= 1|w^j

i ^,^Θ^k

i ⁾

β^k

i ^p⁽^w^j

i ^|^Θ^k

i ⁾

k=1 ^β^k

i ^p⁽^w^j

i ^|^Θ^k

i ⁾

(6.54)

After the expectation step, we perform the maximization step to compute Θ^k

i ^as

ˆμ^k

i ⁼

j=1 ^ˆ^ξ^jk

i ^w^j

j=1 ^ˆ^ξ^jk

(6.55)

ˆσ^k

i ⁼

j=1 ^ˆ^ξ^jk

i ⁽^w^j

i ⁻^ˆ^μ^k

i ⁾²

j=1 ^ˆ^ξ^jk

(6.56)

ˆα^k

i ⁼

j=1 ^ˆ^ξ^jk

(6.57)